Lecture #14 


* Hash Tables 
— The Modulus Operator 
— Closed hash tables 
— Open hash tables 


— Hash table efficiency and “load 
factor” 


— Hashing non-numeric values 


— unordered map: A hash-based STL 
map class 


e (Database) Tables 


Big-OH Craziness 


Consider a binary search tree that holds N student 
records, all indexed by their name. 


Each student record contains a linked-list of the L 
classes that they have taken while at UCLA. 


Name: 
Rick 
Classes: 


What is the big-oh to 
determine if a student has 
taken a class? 


Name: Sal 


Name: Classe 
Lind Left Right 
ee ma) bool HasTakenClass( 


BTree &b, 
string &name, 
String &class 


lass: EE10 ) 
ext: Mipi 


Hash Tables 


"SHOULD HAVE 
USED A HASH 


T TABLE BUT USED 
d ALINKED LIST 


Hash Tables 
What’s the big picture? 


Hash tables offer faster searching than a BST, 
but they don’t order their elements 


alphabetically. 
Create an array with N slots to hold your values. 


Define a func f() that takes a value V as an input 
and produces a slot # from 0 to N-1 as its 


.g., for people, 
might add the 


“Suzie” 
“Frances” 


“Leslie” 
“Natasha’ 


Use f() to find the right slot # to either add a 
new item, or search for an existing item. 


Occasionally, f() will map several items to the 
same slot. You can either place the 2™ item in 


The Modulus Operator 


In C++, the % operator is used to divide two 
numbers and obtain the remainder. 


12 R 
For example, if we compute: Ji234 

ntx = 1234% 100; agg 

the value of x will be 34. 534 

200 

Now, as it turns out, the modulo 34 
operator has an interesting 

property! 


Let’s see if you can 
figure out what it is... 


i The Modulus 


Let’s modulus-divide a 
by 5 and see what t 


10%5=0 
11%5=1 


ZA Let's just store 

49 that interesting 
fact away in you 
brain for later... 
What do you notice? 


When we divide numbers by 5, all of the 
remainders are less than 5 (between 0-4)! 


Let’s try again with 3 for fun! 


When we divide numbers by 3, all of the 
remainders are less than 3 (between 0- 


| 
And as you’d guess, 4f'you divided a bunch 
of numbers by 100,000, the remainders 
would all be less than 100,000 (between 0- 


Rule: When your Wide by a given 


value N, all of your remainders are 
guaranteed to be between 0 and N-1! 


The “Hash Table” 


OK... So far, what’s the most efficient ADT 
we know of to insert and search for data? 


Right! The Binary Search Tree - it gives 
us O(log,N) performance! 


Can we do any better? If so, how much better? 


Challenge: 


Build an ADT that holds a bunch of 9-digit student ID#s 
such that the user can add new ID#s or 
determine if the ADT holds an existing ID# 
in just 1 step - not O(N) or O(log,N) but O(1). 


The (Almost) Hash Table 


How can we create an ADT where we can insert the 
9-digit student ID#s for all 50,000 UCLA students... 


and then find if our ADT holds a given ID# 


That can’t be done... can it? 


It can, and let’s see how! 


Let’s use a really, really large array to hold our #s. 


The (Almost) Hash Table 


Class AlmostHashTable Idea: 
{ Let’s create an array 
public: with 
void addItem(int n) 1 billion slots - one slot 
í for each valid ID#. 


—>m_array[n] = true; 
} 


To add a new ID# witha 
ha th value of N, we'll simply 


—>return m_array[q] == true; set array[N] to true. 


} 
private: To determine if our array 


bool m_array[100000000]; // big! | holds a previously-added 


}; 
int s value Q, simply check if 
mg maani) array[Q] is tru"? 


{ 
—>AlLmostHashTable x; ne 


—>x.addI tem (400683948) ; 
— rif (x. holdsItem(1234) != true) 
—>cout<< “Couldn’t find 1t!”; a 
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The (Almost) Hash Table 


OK - so now we know how to build an O(1) search! 
But what’s the problem with our ADT? 


It’s really, really inefficient: 
Our array has 1 billion slots 

yet there are only 50,000 UCLA student IDs 
we could possibly add to it, 

so we're wasting 999,950,000 of the slots... 


lt would be great if we could use the same 
algorithm but with a smaller array, say one with 
100,000 slots instead of 1 billion! 
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The (Almost) Hash Table 


Let’s call this 
function, f(x), 

a Mapping 
function. 


000,000,000 


024,641,083 
605,172,432 


999,999,999 
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By the way, the official CS 
lingo fora “slot” in the array 
isa “bucket.” 


So that’s what we'll call our 
slots from now on! [] 


9. ote ID 


100,000 element 


Then we track our ID# in hold our 


that slot by setting it to 
true. 


a... 
a new item 
ð, we can do 


return m_a 


} 


private: 
int mapFunc(int idNum) 
{ /* 7??? */ } 


bool m_array[100000]; // not so big! 
}; 


this... 
And to search in 
one step... 
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The Mapping Function 


How can we write a mapFunc that converts our large ID# 
into a bucket # that falls within our 100,000 element array? 
= KE a JU 


value idNum and 

returns an output value 
between 

O and ARRAY SIZE - 1. 


int mapFunc(int idNum) 


t const int ARRAY SIZE = 100000 


int bucket = idNum % ARRAY_ SIZE; 
return bucket; 


corresponding 
value can be 


RIGHT! The C++ % operator used to pick a 
(aka the modulus division operator) bucket in our 


does exactly what we want!!! 100,000 element 
array! 


So now for each input ID# we can 
compute a corresponding value 
between 0-99,999! 


s The (Almost) Hash Table 


class AlmostHashTahle 


Let’s see how it 


{ BLE 400,683,948 % works 
public: = ; 
void addItem(int 299,000 = 83,948 
{ 
—> int bucket = mapFunc(n); m_arrayo] 
See oy ete = true; [1] 
private: 
int mapFunc(int idNum) [5223] 


[5224] 
5225] 


return idNum % 100000; 


The true value in 
slot 83,948 
Indicates that the 
value 400,683,948 


} 


bool m_array[100000]; // not 
}; 


int main() 


{ 

—>AlLmostHashTabLle2 x; 

—> x. addI tem (400683948) ; 
x.addItem(111105224) ; 
x.addItem(222205224) ; 


} 


7 The (Almost) Hash Table 


class AlmostHashTabJe 


Let’s see how it 
works. 


111,105,224 % 


{ 
ublic: 
i 100,000 = 5,224 


void addItem(int 
{ 

—> int bucket = mapFunc(n); 
—>m _array[bucket] = true; 
} 


The true value in 
Slot 5,224 indicates 
that the value 
111,105,224 is held 
In our ADT. 


private: 
int mapFunc(int idNum) 


return idNum % 100000; 
} 


bool m_array[100000]; // not so big! 


y; 


int main() [83947] | 
{ 4 

AlmostHashTable2 x; ee 8l true | 
x.addItem(400683948) ; [83949] 


— > xX , add I tem (10111052244 ; 
x.addItem(222205224); 


} 
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most) Hash Table 


Ok, let’s add the 
last ID# to our 


But our mapping 
function wants to 
also put a true value 
in slot 5,224 to 
represent 
222,205,224! 


But wait! We already stored 
a true value in bucket 5,224 


This is called 
a collision! 
SU Í Í 


O VV 


} 
bool m_array[100000]; // 


Fi 


int main() 


ambiguous! How 
can | tell if my hash 
table holds 
222,205,224 or 
111,105,224? 


AlmostHashTable2 x; 
x.addItem( 400683948) ; 
x.addItem(111105224) ; 
—> x.addItem(2222052 22%; 
} 


17 Tne (Almost) Hash fable: A 
problem! 


A collision is a 
condition where two arrayjoj 
or more values both [1] 
map to the same 


bucket in the array. 111,105,224 


This causes 
ambiguity, 
and we can’t tell a a An 
what value was [83947] 
actually [83948] 
lsto'redea hobo 4o dix! [83949] 


this problem! 


REAL Hash Tables 


There are many schemes for dealing with 
collisions, and today we'll learn two of the most 
popular... 


The Closed Hash Table The 
with “Linear Probing” “Open Hash Table” 
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Closed Hash Table with Linear Probing: Insertion 


Linear Probing Insertion: 


As before, we use our mapping arrayjoj 
function to locate the right buq y 
our array. so we can’t put our value 


here! 


If the target bucket is empt 
can store our value there. 


However, instead of storing true in 222 205,224 [] f(x) 
the bucket, we store our full 
Original value - this prevents 

ambiguity! 


This bucket is 
currently empty, so 
we can put our new 
value here. 


This ne 


z 2 we Ca 
If the bucket is occupied, sq 


down from that bucket until we hit 
the first open bucket. Put the new 
value there. 
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Closed Hash Table with Linear Probing: Insertion 


Linear Probing Insert 


Woot! Finally a free spot! 


Sometimes, you'll need to inse 
item near the end of the table... 


For instance, let’s say we want to 
insert a new value of 640,099,998 
into our hash table. 


If you run into a collision o 
last bucket, and go past the 


You simply wrap back aroun 
top! 


Whoops! I’ve gone past the 
end of the table! 


j Closed Hash Table with Linear Probing: 
Searching 


Linear Probing Searching: 


Hmm, this bucket doesn’t 
have my value... lII keep 
looking for it until | hit an 


To 
111,105,224 


To search our hash table, we 
similar approach. 


We compute a target bucket 
number with our mapping 
function. 


We then look in that bucket for our 
value. If we find it, grea 


Hmmm. | didn’t find my value 

and I ran into an empty bucket. 
My value must not be in the 

array! 


If we don’t find our value, wi 
linearly down the array un 
either find our value or hi 
empty bucket. 


If while probing, you ru 
bucket, 
it means: your value Is 


And as before, if you end up 
searching past the end, just 
wrap back up to the top! 


r2 Closed Hash Table with Linear 
Probing 


, array, oy 
This approach addresses 7 


collisions by putting each 


value as close as possible vo 
to its intended bucket. [5223]] 


[5224] 

Since we store every Meen 
original value (e.g., 

111,105,224) in the array, [99997] 

[99998] 


there is no chance of 
ambiguity. [99999] 


°° Closed Hash Table with Linear 
Probing 


So why do we call this a 
“Closed” hash table??? 


array, oy 
Since our data Is stored ina [1] 


fixed-size array, there area 
fixed (closed) number of buckets 
for us to put values. 


Once we run out of empty buckets, 
we can’t add new values... 


Linked lists and binary search trees [99997] 
don’t have this problem! [99998] 
[99999] 


Ok, let’s see the C++ code now! 


Linear Probing Hash Table: The Details 


In a Linear Probing Hash Table, each bucket in the 
array 


If this field is false, it means that 
this Bucket in the array is empty. 


1. A variable to 


PK)used” field 
hash table ha 


If the field is true, then it means 
this Bucket is already filled with 
valid data. 


struct By 
{ 


// a bucket std 
ID#) 


int idNum; 


ites 


used; // is bucket in-use? 
}; 


const int NUM BUCK = 10; 
class HashTable 


If the current bucket is 
already occupied by an 
item, advance to the 
next bucket (wrapping 
around from slot 9 back 
to slot O when we hit the 


{ 
public: 


void insert(int idNum) 
int bucket = T 
- A 


for (int tries=0;tries<NUM BU A 
| 


if (m_buckets[bucket].used == 
{ 


m_buckets [bucket] .idNum = 
m_buckets[bucket].used = t 
return; 


(bucket + 1) % NÚ 


bucket = 


e merts buckets and then taking 
; // no room left in hash ta i} the remainder (%). 
private: | 


aka “buckets.” 
int mapFunc(int idNum) const at 
{ return idNum % NUM BUCK; } | 


BUCKET m_buckets[NUM_BUCK] ; 


const int NUM_BUCK = LOM MAAE ANUV O ; 
class HashTable bucket = 29 % 10 ear Probing: 


{ k . 
Pe 29 Our bucket is Inserting 


void insert(int idNum) currently 
{ 

—p int bucket ue ae 
there’s room 
here for our 

new item! 


“used” field initialized to 
false. 


This indicates that 


r 


} 
left in hash table!!! 
r // no room le in has able HashTable ht; 


private: 
int mapFunc(int idNum) const 
{ return idNum % NUM_BUCK; } 


BUCKET m_buckets [NUM_BUCK]; 


ht.insert(29); 
ht.insert(65); 
ht.insert(79); 


}; 


const int NUM BUCK = 


Our bucket is 
class HashTable currently ear Probing: 
Seer empty, so Inserting 


there’s room 
here for our 
new item! 


void insert(int idNum) 


{ 
—þÞ int bucket = mapFunc(ic 


—> for (int tries=0;tries<NUM BUCK; tries++) 


{ 
—pif (m_buckets[bucket].used == false) 


7 idNum: used f 
8jidNum:___]_ used f|] 
9\idNum{_29 | used T| 


{ 
—» m buckets[bucket].idNum = idNum; 
— p> m_buckets[bucket].used = true; 
—p return; 


} 
bucket = (bucket + 1) % NUM BUCK; 


main () 


{ 
HashTable ht; 


} 
// no room left in hash table!!! 


} 


private: 
int mapFunc(int idNum) const 
{ return idNum % NUM_BUCK; } 


BUCKET m_buckets[NUM_BUCK] ; 


ht.insert(29) ; 
ht.insert(65); 
ht.insert(79); 


}; 
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const int NUM BUCK 


TARR r nesRaValfice 5 


(wrapping arov 
the end). 


This is the same 
as: 


| bucket = bucket + 1; 
\ if (bucket == 


Our new bucket is empty! Winear Probing: 


There’s room here for our nserting 


new item! 


looking for an 
empty slot. 
et].used == false) 


bucket] .idNum = idNum; 
= true; 


> return; idNum: 29] used Ta] 


E bucket = (bucket + 1) % NUM_BUCK; 
} 


// no room left in hash table!!! 


} HashTable ht; 


private: 
int mapFunc(int idNum) const 
{ return idNum % NUM_BUCK; } 


BUCKET m_buckets [NUM_BUCK]; 


ht.insert(29); 


ht.insert(65); 
ht.insert(79); 


}; 


const int NUM BUCK = 
class HashTable 
{ 


10; 


public: 
bool search(int idNum) 
O O 


if (m_buckets[bucket] .use 
return false; 

if (m_buckets[bucket] .idN 
return true; 


bucket = (bucket + 1) % NU 


return false;// not in the hash table 


private: 
int mapFunc(int idNum) const 


{ return idNum % NUM BUCK; 


BUCKET m_buckets[NUM BUCK]; 
}; 


Po the 


every bucket and 
didn’t find our 

item, then it’s not 

in the hash table! 
Tell the user. 


eas eet 
(and haven't 
yet found our 
item) then we 
know our item 
} is not in the 


found our item 
and we're done. 


const int NUM BUCK = 10; 
class HashTable 


— = ~ mos =.. — 


This bucket is in use 


{ 
public: 29 
bool search(int idNum) 


The bucket holds a 
value of 29, which 
matches the value 


{ 
—Þ int bucket = mapFunc (ro 
we're searching for. 


—> for (int tries=0;triess 
{ 
—pÞ if (m_buckets[bucket]. 
return false; 


—pÞif (m_buckets[bucket].idNum == idNum) 
—> return true; 


bucket = (bucket + 1) % NUM_BUCK; 


} 
return false;// not in the hash table 


} 


private: 
int mapFunc(int idNum) const 
{ return idNum % NUM_BUCK; } 


BUCKET m_buckets[NUM BUCK]; 


HashTable ht; 


bool x; 

ht.search(29); 
ht.search(175); 
ht.search(20); 


>x< 
ou soil 


Jr 


const int NUM BUCK = 10; 


poe Taa aE The bucket is not 
public: amr na latlc coan i 


bool search(int idNum) The bucket holds the 


{ 
Bint pucker = mapë @; value (175) we were 


J looking for! 
—> for (int tries=0; 


{ 
—Þ if (m buckets[b 
return false; i 
—p if (m buckets[bù Keep looking! 
— P return true; 


—>P bucket = (bucket + 1) % NUM BUCK; 
} 


return false;// not in the hash table 
} 


private: i. 
int mapFunc(int idNum) const bool x; 
{ return 1dNum % NUM BUCK; } ht.search(29) : 


BUCKET m_buckets[NUM BUCK]; ht.search(175) ; 
}: ht.search(20) ; 


HashTable ht; 


const int NUM BUCK = 10; 
class HashTable 


{ 
public: 20 


Nope. We're looking 
for 20, but this bucket 


We haven’t found our item 
yet, but there still a 
chance since we haven't 

run into an empty slot. 
eep looking! 


bool search(int idNum) 


{ 
—> int bucket = mapF 
— > for (int tries=0; 


{ 
—> if (m_buckets[b 
—~ return false; 
—> if (m_buckets[b 
return true; 


ere = (bucket + 1) %-" The bucket is empty. This means 


return false;// not in th that the value (20) we're searching 
for can’t possibly be in the table. If 
it were in the table, we’d have 
already found it before hitting an 
empty slot! 


BUCKET m_buckets[NUM BUCK]; x = ht.search(175) ; 
}; x = ht.search(20); 


} 


private: 
int mapFunc(int idNum) con 
{ return idNum % NUM \ 


Nhat Can you Store in your Hash Table? 


l . struct Bucket 
Oh, and if you like, you can 


include additional associated 
values 

(e.g., a name, GPA) in each 
For instance, bukaett | 

want to also store] bool search(int id, string œp 

student’s name 

GPA in each buck 

along with their IL 


Even though we 
choose our bucket 
# based on the 
ID#... 


int bucket = mapFunc(idNum) ; 
for (int tries=0;tries<NUM BUCK; tries++) 


false) 


You can do that! 


if (m_buckets[bucket].used == 
return false; 

Lf (m_buckets[bucket].idNum == idNum) 

arn true; 

me = m buckets [bucket] . name; 


i busket m bbheurke$i udhet NBM: BUCK; 
reġurn false;// not in the hash table 


We can store as 
many other 
associated field 
values in the 
bucket as we 
like! 
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Ling Wait a second, ells bucket is ? Ay RES A 
empty! 1iidNum:___]_used f || 
2\idNum:___]_used f || 
Ao ooo 


sidnamn ae used) aNu: Ez | usec T 


AINU — U a eae 


7 \|idNum:L175] used T| 
8 Mm user 
9\idNum:29 | used T| 


ssi ate Melelaactaam Pucker = 15 % NUM BUCK 
t= 15 %10 


cket = 5 


But in fact, the value of 15 /s 
in our table - in fact, it’s in LZ 
the next slot down! 


from your hash table. 


oN 


ri Like if you’re building a 
—> hash table that holds 
ee words for a dictionary... 


You'll just add words, 
never delete any, right? 


return true; 


The “Open Hash Table” 


We just saw how to use linear probing to deal with 
collisions in our closed hash table. 


Our closed hash table + linear probing works just 
fine, but it still has a few problems: 
It’s difficult to delete items 


It has a cap on the number 
of items it can hold... That’s a bummer. 


It'd be nice if we could find a way to avoid both of 
these problems, yet still have an O(1) table! 


We can! And it’s called the “Open Hash Table.” 
Let’s see how it works! 
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Search the linked list at 


. If we reach the end of the 


Cool! Since the linked 

list in each bucket can How about 
hold an unlimited searching our 

numbers of values... Open hash table? 


Our open hash table is 
not size-limited like 
our closed one! 


array[ bucket] for your item 


list without finding our 
item, it’s not in the table! 


Insert the following values: hl 
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he “Ope n” Ha 9 Oh - and there’s no reason wh} 


we have to use a linked-list to 


Question: deal with collisions... 
How do you delete an 
item from an open hash array of 
table? j 


Answer: 0 
You just remove the value 1 
from the linked list. : 


Let’s delete the student 


If you plan to repeatedly insert and 
delete values into the hash table, then 
the Open table is your best bet! 


Also, you can insert more than N items into your 
table and still have great performance! 


yrs a 
$ Fi =y 


Hash Table Efficiency 


Question: How efficient is the hash table ADT? 
How long does it take to locate an item? 
How long does it take to insert an item? 
Answer: 
It depends upon: 
(a) The type of hash table (e.g., closed vs. open), 
(b) how full your hash table is, and 


(c) how many collisions you have in the hash table. 


Closed Hash Table Efficiency — 


O Name: etc... 


Let’s assume we have a a eGo 


completely (or nearly) empty 1 | Name: etc... 
What's the Maxis number of el E se 
steps required to insert a new INU GPA. 
value ? 3 Name: etc... 

Right! There’s zero chance of q |idNum:-1 GPA: 
collision, so we can add our new Name: etc... 
value in one step! 5 MN n 

And findin ly- g | idNum:-1 GPA: 
empty ha bucket = convert(12); st! Name: etc... 
bucket = 2 7 eae oe GPA: 

ame: etc... 

We have no er we eT os ca 
find anitemrrightaway or we know 8 | Name: etc... 
it’s not in the hash table.. g |idNum:-1 GPA: 


Name: etc... 


‘Closed Hash Table Efficiency 
O 


Ok, but what if our hash table is 
nearly full? 


What’s the maximum number of steps 
required to insert a new value ? 


Right! It could take 


And searching can take just as long 
in the worst case... 


à 
= 


\ 


IN 


So technically, a hash table can be up 
to O(N) when it’s nearly full! 


7 

So how big must we make our hash g 
table so it runs quickly? To figure this 

out, we first need to learn about the o 


“laad” concent 


\ idNum:-1 GPA: 
Name: etc... 


idNum:8 
Name: Tad etc... 


idNum21 GPA:4.0 
Name:Abe etc... 


idNum: 12 GPA:3.2 
Name: Ben etc... 


idNum:42 GPA:3.9 


GPA:3.87 


Name: Liz etc... 
idNum:34 GPA:1.10 
Name: Al etc... 


dNum:04 GPA3.89 


Name: Jill etc... 
dNum624 £GPA3.4 
Name:Hoa_ etc... 
idNum:2&  GPAi.7 
Name:Bill etc... 
idNum:29  GPA2.1 
Name: Nat etc... 


Hash Table Efficiency: The Load Factor 


The “load” of a hash table is the 
maximum number of values you intend to add 
divided by 
the number of buckets in the array. 


Max # of values to insert 
Total buckets in the array 


Example: A load of L=.1 means your array has 10X more 
buckets than you need (you'll only fill 10% of the 
buckets). 


Example: A load of L=.9 means your array has 10% 
more buckets than you need (you'll fill 90% of the 
buckets). 


idNum: -1 GPA: 
Name: etc.. 
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Closed Hash w/Linear Probing Efficienc:=::: 


idNum: -1 GPA: 
Name: etc.. 


Given a particular load L for a Closed Hash Table w LF 


idNum: -1 GPA: 
Name: etc.. 


it's easy to compute the average # of tries it'll take?= = 


you to insert/find an item: saree 


Average # of Tries = %2(14+ 1/(1-L)) for L < 1.0 


So, if your closed hash table has a 


load factor of your search will take 
.10 (your array is 10x bigger than required) ~1.05 searches 
.20 (your array is 5x bigger than required) ~1.12 searches 


.30 (your array is 3x bigger than required) ~1.21 searches 


./0 (your array is 30% bigger than required) ~2.16 searches 
.80 (your array is 20% bigger than required) ~3.00 searches 
.90 (your array is 10% bigger than required) ~5.50 searches 
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Open Hash Table Efficiency 


OANDAKRWNHHYO 


NULL 


Given a particular load L for an Open Hash Table, it’s also 
easy to compute the average # of tries to insert/find an 
item: 


Average # of Checks = 1 + L/2 


So, if your open hash table has a 


load factor of your search will take 

.10 (your array is 10x bigger than required) ~1.05 searches 

.20 (your array is 5x bigger than required) ~1.10 searches 
.30 (your array is 3x bigger than required) ~1.15 searches 

./0 (your array is 30% bigger than required) ~1.35 searches 

.80 (your array is 20% bigger than required) ~1.40 searches 


.90 (your array is 10% bigger than required) ~1.45 searches 
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caen vs. Open Hash Table 


[i 


idNum: -1 GPA: 


z 
2 


N 
z5 
r4 
ZSR 


e: Ben etc. 
idNum: -1 GPA: 
ete. 


w 
fz 
3 


> 
alg 
E 
HE 


a 
Ẹ 
a 


o 
Ar 
ALE 
5/25 


s 
Z 
è 


œ 
H 
25 


o 
Za 

Z 
25 


Closed Hash w/L.P. 


Load Avg Steps 


.10 ~1.05 searches 
.20 ~1.12 searches 
.30 ~1.21 searches 
.70 ~2.16 searches 


.80 ~3.00 searches 
.90 ~5.50 searches 


NUL 


OANDIRWNHHO 


NULL 


Open Hash 
Load Avg Steps 
.10 ~1.05 searches 
.20 ~1.10 searches 
.30 ~1.15 searches 
.70 s135 searches 
.80 ~1.40 searches 
.90 ~1.45 searches 


Moral: Open hash tables are almost ALWAYS more 
efficient than Closed hash tables! 
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Sizing your Hash Table 


This result means: 


Open Hash 
“If you want to be able to find/insert items ghly 1.25 
into your open hash table in an average of 
1.25 steps, you need a load of .5, or roughly ple have? 
2x more buckets than the maximum = +HLf2/2 


number of values you’| 


If our hash table has 2000 
et the equation buckets and we're inserting 
a maximum of 1000 values, 
we are guaranteed to have 
an average of 1.25 steps 
per insert/search! 


Part ive 


for L: 
125= — 


Part 2: Use the load for 


eee to Inset se reGuised-hash-H090-sine—a1000|= 2000 
Required hash table size “Requirec ash table siz&8 [buckets 


So basically it’s a tradeoff! 


You could always use a really big hash table with 
way-too-many buckets and ensure really fast 
searches... 

But then you'll end up wasting lots of memory... 


On the other hand, if you have a really small hash 

table (with just barely enough room), it'll be slower. 

Finally, when choosing the exact size of your hash 
table (the number of buckets)... 


Always try to choose a prime number of buckets... 


Instead of 2000 buckets, 
give your hash table 2017 buckets. 


This causes more even distribution and fewer 


errili-e-einnec l 
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What Happens If... 


A hash function is a function that takes 
an arbitrary input (like a string)... 


How do we convert a 
name into a bucket 
number?!??!? 


int mapFunc(int ID) string &name) 
{ 


6 100000; intwha= Heshdname) ; 


} return h % 100000; 


return ID % 


Well, we need a two-step process! 


First, We We See He ig S pein VE Eora pL EM 
bucket number th NE fts in Ao our hash table 


= A Mash Function tor 


p tringS. a 
Here’s one possibility for a hash function that can 
convert a string into an integer value. 


int hash(const string &name) 
Hint: 
int i, total=0; 
What happens 
if we hash “BAT”? 


for (i1=0;i<name.length(); i++) 
total = total + namel[i]; 


return(total); What happens 
} if we hash “TAB”? 


But this hash function isn’t so good. Why not? 
How can we fix it? 
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A Better Hash Function for Strings 


Here’s better version of our string hashing function - 
while not perfect, it disperses items more uniformly in 
the table. 


int hash(const string &name) 


int i, total=0; 


for (i1=0;i<name.length(); i++) 
total = total + (i+1) * name[t]; 


return(total); 


} 


Now “BAT” and “TAB” hash to different slots in our 
irray since this version takes character position into account 


A GREAT Hash Function for 


Make sure to 
#include<functional> 


to use C++’s hash 
function! 

First you define a C++ 
string hashing object. 


Rather than write yo 
why not use 


C++ paa a 


// creates a string 


hasher! 


unsigned int hashValue = str_hash(hashMe); // now hash our 
string! 


This returns a hash value 
between 0 and 4 billion. 


ıse the object to 


Finally, you apply your own 
modulo function and return 
a bucket # that fits into 
your hash table’s array. 


52 


Writing Your Own Hash Function 


Great! But what if you need to write a hash function for 
some non-standard data type? 


TERTRE 
&hashMe) 
{ 


Like hashing... 


Geospatial coordinates 
An array of N numbers 
The contents of a data file 


This is a non-trivia 


You really need to understand the “nature” of the data 
you're hashing... 


Then design your algorithm, analyze it, and iterate. 


3 Choosing a Hash Function: 
Tips 
1. The hash function must always give us the same 


output 
value for a given input value: 


Today: hash(400683948) [] 83,948 
Tomorrow: hash(400683948) [] st/// 83,948 
2. The hash function should disperse items throughout 
the 


hash array asta, ndo mLa Na good! 
hash(“cba”) = 294 


3. When coming up with a new hash function, always 


measure how well it disperses items (do some 
pome 


= ey Kema i i 
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The unordered_map: A hash-based version of a map 


#include <unordered_map> 
#include <iostream> 
#include <string> 


using namespace std; 


int main( ) 

{ 
unordered_map <string,int> hm; // define a new UM 
unordered_map <string,int>::iterator iter; // define an iterator for a U_M 
hm["Carey"] = 10; // insert a new item into the U_M 
hm["David"] = 20; 
iter = hm.find("Carey"); // find Carey in the hash map 
if (iter == hm.end()) // did we find Carey or not? 

cout << “Carey was not found!”; // couldn't find “Carey” in the hash m 

else 
{ 


cout << "When we look up " << iter->first; // “When we look up Care 
cout << " we find " << iter->second; // “we find 10” 
} 
} 


5 Ha 


In fact, if you want to expand your hash table’s 
size you basically have to create a whole new 
one*: 


Allocate a whole new array with more 
buckets 


Speed 2. Rehash every value from the original table 
into the new table 
l _ Free the original table te 
O IM plement z 
Simplicity Poona 


Closed: Limited by array 
Max Size size Unlimited size 
Open: Not limited, but high 
load impacts performance 


Wastes a lot of Only uses as much 
Space space if you have a memory as needed 
Efficiency large hash table (one node per item 
holding few items inserted) 
Ordering No ordering Alphabetical ordering 


(random) 


Tables 


EVERYTIME YOU CHANGE THE DATABASE 


TABLE 
4 
ISA 


A 
<= - 
$ 
E # h A 
| Ea 
Jle n ~ n - 
. " 


Tables 
What’s the big picture? 


A table is an ADT that stores a bunch of 
“records” (like student information) and then 
provides multiple ways to search for an item. 


For example, you could search for students by 
phone number, full name, and/or student ID. 


A simpler ADT like a hash table or BST only 
enables searching efficiently by a single field 


In a table ADT your PF student ID BST 
store the full record a8 
(name, phone number, rye” 
student ID, GPA, ...) 5 


Then YER S ultiple 


light “indexes” (e.g., 
using BSTs/ hash tables) 
to speed up searching by 
different field types. 


name: Aigx 
O GPA: 2.05 
ID: 7124 


name: Linda 
GPA: 3.99 


Al ID: 0003 


name: Jason 


ID#: 0003 | | ID#: 4006 | ID: 876 
Slot: 1 Slot: 5 


[nul nuli nui nuli 


name: Carey 
GPA: 3.62 
ID: 4006 
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“Tables” 


Let’s say you want to write a 
program to keep track of all 
your BFFs... 


Of course, you want to 
remember all the 
important dirt about each 


BFF: . 
And you want to quickly be 


able to search for a BFF in 
one or more ways... 


“ Find all the dirt on my BFF 
‘David Johansen’ ” 


“ Find all the dirt on the BFF 
whose number is 867-5309 ” 


Name: Carey Nash 
Phone number: 867-5309 
Birthday: July 28 
iPhone or ‘droid: iPhone 
Social Security #: 
111222333 
Favorite food: ... 
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In CS lingo, a groug 
data is called a “record.” 


Each record has a bunch of “fields” 
like Name, Phone #, Birthday, etc. 


Our Social Security field is a 
“key” field since every record 
is guaranteed to have a 
unique value for this field. 


pA AIA As pala =A 


Name: John Rohr 
Phone number: 999-9191 
Birthday: Jan 1 
iPhone or ‘droid: Droid 
Social Security #: 47372727 


A field f@ke"tAE SSN) that 
has unique values across 
all records is called a 

“key field. '@a 


While yoù | maT 

with the same Name feld value 
(e.g., John Smith) or the same 

Birthday field value (e.g., Jan 1°)... 


Some fields, like Social Security 
Number, will have unique values 
across all records - this type of field is 
useful for searching and finding a 
unique record! 


60 


Implementing Tables [struct student 


{ 
How could you create a record in 


C+ -AAswer: Just use a struct or class Oe aut 
to represent a record of data! ee 


string phone; 
How can you create a table in C+ 


E 
+? Answer: You can simply create an 


array or vector of your struct! vector<Student> table; 


// algorithm to search by the phone field 


int SearchByPhone(vector<Student> &table, string &findPhone) 
{ 


for (int s = 0; s < table.size(); S++ ) 
if (findPhone == table[ s ].phone) 
return( s ); // the student you're looking for is in slot s 


return( -1 ); // didn’t find that student in your table 
} 
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Implementing Tables 


Heck, why not just create a as aoe 


whole C++ class for our table? string name: 

class TableOfStudents int IDNum; 

{ float GPA; 

public: string phone; 
TableOfStudents(); // construct a new table m 

~TableOfStudents(); // destruct our table le 

void addStudent(Student &stud); // add a new Student 

void TableOfStudents::addStudent(Student &record) 

{ 


int TableOfStudents::searchByName(string &name) 
{ 


for (int s = 0; s < m_students.size(); s++ ) 


if (name == m_students[ s ].name) 
return( s ); // the student you’re looking for is in slot s 


return( -1 ); // didn’t find that student in your table 
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Tables 


In the TableOfStudents class, we used a vector to hold our table 
and a linear search to find Students by their name or phone. 


This is a perfectly valid table - but it’s slow to 
find 
a student! How can we make it more efficig 
Well, we could alphabetically sort 
our vector of records by their 
names... 
Then we could use a binary search 
to quickly locate a record based on 


Name: David 
ID #: 111222333 


GPA: 2.1 
Phone: 310 825- 


a person’s name. Name: John 
| ID #: 95847362 
But then every time we add a new GPA: 3.8 


record, we have to re-sort the whole Phone: 818 416- 


table. Yuck! 


Name: Carey 
ID #: 400683945 
GPA: 4.0 
Phone: 424 750- 


And if we sort by name, we can’t 
search efficiently by other fields like 
phone # or ID #! 


l Tables 


Hmmm... What if we stored our records in a binary search tree 
(e.g., a map) organized by name? Would that fix things? 


Name: David 
Z  ~ ID #: 111222333 
GPA: 2.1 
Phone: 310 825- 


Name: John 
ID #: 95847362 
GPA: 3.8 
Well, now we can search the table efficientl) Phone: 818 416- 


But we still can’t search efficiently by ID# or Phiwafee Carey 
ID #: 400683945 


GPA: 4.0 
Phone: 424 750- 


Name: Albert 


ID #: 012191928 
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Tables 


Hmmm... What if we create two tables, 
ordering the first by name and the second by ID#? 


Name: David 
ID #: 111222333 
GPA: 2.1 
Phone: 310 825-1234 


Name: David 
ID #: 111222333 
GPA: 2.1 
Phone: 310 825-1234 


Name: Albert Name: John Name: Albert Name: Carey 
ID #: 012191928 ID #: ALA ID #: 012191928 ID #: 400683945 
GPA: 1.5 GPA: 3.8 GPA: 1.5 GPA: 4.0 
Phone: 626 599-5939 Phone: 424 750-7519 


Phone: 626 599-5939 Phone: 818 416-0355 


Name: John 
ID #: 95847362 
GPA: 3.8 
Phone: 818 416-0355 


Name: Carey 
ID #: 400683945 
GPA: 4.0 
Phone: 424 750-7519 


That works... Now I can quickly find people by name or ID#! 


But now we have two copies of every record, one in each 
tree! 


If the records are big, that’s a waste of space! 
So what can we do? Let’s see! 
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Making an Efficient Table 


1. We'll still use a vector to store all of our records... 


2. Let’s alsv 


These secondary data structures 
are called “indexes.” 


Each index lets us efficiently find 
a record based on a particular 
field. 


We may have as many indexes 
as we need for our application. 


&name); 


vebtore 
nrivetest 
map<int,int> 
map<int,int> 


Acresctntabaems: 
g,int> m_nametToSlo 
m_idToSlot; 
m_phoneToSlot; 


ID#: 9876 
Slot: 3 


ID#: 4006 
Slot: 5 


ID#: 0003 
Slot: 1 


1 


m_students 


name: Alex 
GPA: 2.05 
ID: 7124 


name: Linda 
GPA: 3.99 
ID: 0003 


name: Jason 
GPA: 1.55 
ID: 1054 


name: Abe 
GPA: 4.00 
ID: 9876 


name: Zelda 
GPA: 3.43 
ID: 6416 


name: Carey 
GPA: 3.62 
ID: 4006 


°° Making an Efficient Table 


So what does our addStudent 
method look like now? 


Finally, every time we add a record, 
we've also got to add the ID# to slot 
# mapping to our second map! 


} # 


adents.size()-1; // get slot # of new recok 
5Slot[stud.name] = slot; // maps name to slot #: 9876 


m_idToSlot{stud.|[DNum] = slot; // maps ID# to slot 


vec Sra SHR én os a Ugetts; 
RAYBE'string,int> m nameTosl 
map<int,int> m_idToSlot; 


ID#: 4006 ID#: 9876 
Slot: 5 Slot: 3 


fou nu! il nul i nu 


ID#: 0003 
Slot: 1 


m_students 


name: Alex 
GPA: 2.05 
ID: 7124 


name: Linda 
GPA: 3.99 
ID: 0003 


BPA: : 4.00 


ID: 6416 


name: Carey 
GPA: 3.62 
ID: 4006 


Á But wait!!!! - Any time you 
Com p delete a record or update a 
So to review, what do w record’s searchable fields, you 


to insert a new record i also have to update your 
indexes! 


Let’s add: Wendy, ID=1 ee 
y ddi ID: 0003 
name: Jason 
: GPA: 1.55 
7 ID#: 6416 2 
Shiels Ce ID: 1054 
name: Abe 
m ~~ \ 3 GPA: 4.00 
(Faas ID#: 1054 yo 7124 ID: 9876 
Slot: 1 Slot: 2 Slot: O 
- D E name: Zelda 
E_Z Zç |! Ii g GPA: 3.43 
S > NO 4 ID: 6416 
name: Jasqn | | name: Zeld { ID#: 0003 | |D#: 4006 | | ID#: 9876 oa 
Slot: 2 Slot: 4 J Slot: 1 Slot: name: Carey 
f 5 GPA: 3.62 
zd ID: 4006 


name Aggy 


Name: AggyName: ID: 1 
ins 6  ndex: Step 3: linde 
SIE null PTI dmN Gvsirdodect tinh erm ID: 1000 


GPA: 3.9 


` Pe an ge ae 
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Tables 


As it turns out, databases like “Oracle” use 
exactly this approach to store and index data! 


(The only difference is they usually store their 
data on disk rather than in memory) 


And by the way... While my example used 
binary search trees to index our table’s 
fields... 
You could use any efficient data structure you like! 


For example, you could use a hash table! 


69 


Using Hashing to Speed Up Tables r 


Moral: You need to understand how your 
table will be used to determine how to 
best index each field. 


For example: 


l'’'d use a BST for the name field so I can 
print people’s names in alphabetical order. 


cause | just need to search quickly but | 
don’t need to order records by their phone # 


M Nm: Abe | [null 
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Challenges 


Question: What is the big-oh of traversing all of the 
elements in a hash table? 


Question: | have two hash tables: the first has 10 
buckets, and the second has 20 buckets. If I insert 
each of the following IDs into each hash table, where 
will each ID number end up (which bucket #s)? 


ID=5 

ID = 15 
ID = 25 
ID = 100 


Question: How can you print out the items in a hash- 
table in alphabetical/numerical order? 


